Near-miss Modeling: a Segment-based Approach to Speech Recognition Near-miss Modeling: a Segment-based Approach to Speech Recognition
نویسندگان
چکیده
Currently, most approaches to speech recognition are frame-based in that they represent speech as a temporal sequence of feature vectors. Although these approaches have been successful, they cannot easily incorporate complex modeling strategies that may further improve speech recognition performance. In contrast, segment-based approaches represent speech as a temporal graph of feature vectors and facilitate the incorporation of a wide range of modeling strategies. However, diiculties in segment-based recognition have impeded the realization of potential advantages in modeling. This thesis describes an approach called near-miss modeling that addresses the major diiculties in segment-based recognition. Probabilistically, each path should account for the entire graph including the segments that are oo the path as well as the segments that are on the path. Near-miss modeling is based on the idea that an oo-path segment can be modeled as a \near-miss" of an on-path segment. Each segment is associated with a near-miss subset of segments that contains the on-path segment as well as zero or more oo-path segments such that the near-miss subsets that are associated with any path account for the entire graph. Computationally, the graph should contain only a small number of segments without introducing a large number of segmentation errors. Near-miss modeling runs a recognizer and produces a graph that contains only the segments on paths that score within a threshold of the best scoring path. A near-miss recognizer using context-independent segment-based acoustic models , diphone context-dependent frame-based models, and a phone bigram language model achieves a 25.5% error rate on the TIMIT core test set over 39 classes. This is a 16% reduction in error rate from our best previously reported result and, to our knowledge, is the lowest error rate that has been reported under comparable conditions. Additional experiments using the ATIS corpus verify that these improvements generalize to word recognition. Acknowledgments First, I thank Jim for the guidance and encouragement he has given me over the past four years and especially in the last half year of writing. I also thank Victor and Paul for their insightful suggestions that have strengthened this thesis. In addition, I and TJ for all of their helpful comments. I am grateful to Victor and all of the members of SLS who have provided the inspiring research environment in which I have spent these past six years. I am sure that I will miss this environment whenever it is that I nally …
منابع مشابه
Near-miss modeling: a segment-based approach to speech recognition
Currently, most approaches to speech recognition are frame-based in that they represent speech as a temporal sequence of feature vectors. Although these approaches have been successful, they cannot easily incorporate complex modeling strategies that may further improve speech recognition performance. In contrast, segment-based approaches represent speech as a temporal graph of feature vectors a...
متن کاملImproved Bayesian Training for Context-Dependent Modeling in Continuous Persian Speech Recognition
Context-dependent modeling is a widely used technique for better phone modeling in continuous speech recognition. While different types of context-dependent models have been used, triphones have been known as the most effective ones. In this paper, a Maximum a Posteriori (MAP) estimation approach has been used to estimate the parameters of the untied triphone model set used in data-driven clust...
متن کاملSegmentation and Modeling in Segment-based Recognition1
Recently, we have developed a probabilistic framework for segmentbased speech recognition that represents the speech signal as a network of segments and associated feature vectors [2]. Although in general, each path through the network does not traverse all segments, we argued that each path must account for all feature vectors in the network. We then demonstrated an efficient search algorithm ...
متن کاملAllophone-based acoustic modeling for Persian phoneme recognition
Phoneme recognition is one of the fundamental phases of automatic speech recognition. Coarticulation which refers to the integration of sounds, is one of the important obstacles in phoneme recognition. In other words, each phone is influenced and changed by the characteristics of its neighbor phones, and coarticulation is responsible for most of these changes. The idea of modeling the effects o...
متن کاملSpeaker Adaptation in Continuous Speech Recognition Using MLLR-Based MAP Estimation
A variety of methods are used for speaker adaptation in speech recognition. In some techniques, such as MAP estimation, only the models with available training data are updated. Hence, large amounts of training data are required in order to have significant recognition improvements. In some others, such as MLLR, where several general transformations are applied to model clusters, the results ar...
متن کامل